Design Simple Search Engine

Ashish

Ashish Pratap Singh

medium

In this chapter, we will explore the low-level design of a simple in-memory search engine.

Let's start by clarifying the requirements:

1. Clarifying Requirements

Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.

Here is an example of how a conversation between the candidate and the interviewer might unfold:

After gathering the details, we can summarize the key system requirements.

1.1 Functional Requirements

  • Index a predefined set of documents available in memory.
  • Support case-insensitive, keyword-based search. Return a list of documents containing the specified keyword.
  • Support basic ranking of search results (e.g., using keyword frequency within each document)
  • Provide a simple interface to input queries and display search results

1.2 Non-Functional Requirements

  • Modularity: The system should follow clean object-oriented design with well-separated responsibilities.
  • Performance: Search queries should return results quickly, even when handling large sets of documents.
  • Maintainability: The code should be easy to test, debug, and evolve over time
  • Memory Efficiency: The indexing structure should be memory-optimized to store and search documents efficiently

2. Identifying Core Entities

Core entities are the fundamental building blocks of our system. We identify them by analyzing the functional requirements and mapping the key responsibilities to object-oriented abstractions—classes, interfaces, or enums.

Let’s walk through the functional requirements and extract the relevant entities:

1. The system should index a predefined set of documents in memory.

This indicates the need for a Document entity to represent each searchable item. Each document should have a unique identifier and raw text content.

To manage all available documents, we introduce a DocumentStore entity. This serves as an in-memory container that exposes APIs to add and retrieve documents.

For efficient keyword-based retrieval, we require an InvertedIndex—a well-known data structure in search engines. It maps terms (keywords) to the list of documents that contain them, along with metadata such as frequency.

2. The system should return matching documents ranked by keyword frequency.

To support this, we define a Posting entity that represents an occurrence of a term in a document. Each posting includes:

  • Document ID
  • Term frequency (i.e., how many times the term appears in the document)

In addition, we introduce a SearchResult entity that packages:

  • A matched Document
  • A relevance score (e.g., based on term frequency)

3. The system should process queries and return ranked results.

To orchestrate the entire search pipeline, we define a SearchEngine entity.

  • It builds the inverted index using the document store.
  • It accepts queries, applies scoring and ranking strategies, and returns results.

These core entities define the essential abstractions of the in-memory search engine and will guide the structure of your low-level design and class diagrams.

3. Designing Classes and Relationships

This section outlines the classes that form the building blocks of the search engine, their responsibilities, and the relationships between them.

3.1 Class Definitions

The system is designed with a clear separation of concerns, categorized into data classes that hold information and core classes that implement the engine's logic.

Data Classes

These are simple Plain Old Java Objects (POJOs) or data containers with minimal logic.

Document

Document

Represents a single unit of information to be indexed and searched. It contains a unique id, a title, and its content.

Posting

An entry in the inverted index.

Posting

It encapsulates the documentId where a term appears and the frequency of that term within the document.

SearchResult

A container that pairs a Document with its calculated relevance score, used for ranking and display.

SearchResult

Core Classes

These classes contain the main business logic for indexing, searching, scoring, and ranking.

DocumentStore

DocumentStore

Acts as an in-memory database, mapping document IDs to Document objects for quick retrieval.

InvertedIndex

The central data structure of the engine.

InvertedIndex

It maps each term (word) to a list of Posting objects, enabling fast lookups of documents containing a specific term.

A concrete implementation of RankingStrategy that sorts by score, using the document title alphabetically as a tie-breaker.

SearchEngine

The main orchestrator class.

SearchEngine

It provides a simple public API for indexing documents and performing searches, hiding the underlying complexity of the system.

3.2 Class Relationships

The classes interact through a combination of composition, association, and dependency, creating a robust and flexible system.

Composition

The SearchEngine has a strong "owns-a" relationship with its core components.

  • SearchEngine ◆── InvertedIndex: The SearchEngine creates and manages the lifecycle of its InvertedIndex. The index cannot exist without the engine.
  • SearchEngine ◆── DocumentStore: Similarly, the DocumentStore is an integral part of the SearchEngine and is managed by it.

Aggregation

The index and store have "has-a" relationships with their data objects.

  • InvertedIndex ◇── Posting: The InvertedIndex contains a map of terms to lists of Posting objects. The postings are part of the index but represent data linked to independent documents.
  • DocumentStore ◇── Document: The DocumentStore holds a collection of Document objects, which are created externally and added to the store.

Association

The SearchEngine has a "uses-a" relationship with its strategies.

  • SearchEngine → ScoringStrategy: The SearchEngine holds a reference to a ScoringStrategy object. This allows the scoring algorithm to be changed dynamically (pluggable behavior).
  • SearchEngine → RankingStrategy: The SearchEngine also holds a reference to a RankingStrategy, allowing the ranking logic to be easily swapped.
  • SearchResult → Document: Each SearchResult is associated with the Document it represents.

Dependency

Several classes depend on others to perform their tasks, often as method parameters.

  • SearchEngine depends on Document for indexing and SearchResult for returning search results.
  • The ScoringStrategy implementations depend on Posting and Document to calculate a score.

3.3 Key Design Patterns

Several design patterns are employed to ensure the system is efficient, scalable, and maintainable.

Strategy Pattern

This pattern is used to make the scoring and ranking algorithms interchangeable.

ScoringStrategy

ScoringStrate

The ScoringStrategy interface and its concrete implementations (TermFrequencyScoringStrategy, TitleBoostScoringStrategy) allow the client to choose how documents are scored without modifying the SearchEngine.

RankingStrategy

RankingStrategy

Likewise, the RankingStrategy interface and its implementations allow the sorting logic for results to be defined and selected at runtime.

Facade Pattern

The SearchEngine class acts as a Facade. It provides a simplified, high-level interface (indexDocuments, search) to the more complex underlying subsystem of indexing, data storage, scoring, and ranking. This decouples the client from the internal workings of the search engine.

Singleton Pattern

The SearchEngine is implemented as a Singleton to ensure there is only one instance managing the index and document store for the entire application. This provides a single, global point of access and prevents inconsistencies from multiple competing instances.

3.4 Full Class Diagram

Simple Search Engine

4. Implementation

4.1 Document

Represents a unit of information indexed by the search engine.

1class Document:
2    def __init__(self, id: str, title: str, content: str):
3        self.id = id
4        self.title = title
5        self.content = content
6
7    def get_id(self) -> str:
8        return self.id
9
10    def get_title(self) -> str:
11        return self.title
12
13    def get_content(self) -> str:
14        return self.content
15
16    def __str__(self) -> str:
17        return f"Document(id={self.id}, title='{self.title}')"

Each document has:

  • A unique id
  • A title and content for search and scoring

4.2 DocumentStore

Acts as an in-memory database for documents.

1class DocumentStore:
2    def __init__(self):
3        self.store: Dict[str, Document] = {}
4
5    def add_document(self, doc: Document):
6        self.store[doc.get_id()] = doc
7
8    def get_document(self, doc_id: str) -> Optional[Document]:
9        return self.store.get(doc_id)

Supports retrieval by ID during scoring and search result generation.

4.3 Posting

Encapsulates term-specific metadata within a document. Used as entries in the inverted index.

1class Posting:
2    def __init__(self, document_id: str, frequency: int):
3        self.document_id = document_id
4        self.frequency = frequency
5
6    def get_document_id(self) -> str:
7        return self.document_id
8
9    def get_frequency(self) -> int:
10        return self.frequency
  • documentId: The document where the term appears
  • frequency: How often the term occurs (used for scoring)

4.4 InvertedIndex

Maps each term to a list of Postings.

1class InvertedIndex:
2    def __init__(self):
3        self.index: Dict[str, List[Posting]] = defaultdict(list)
4
5    def add(self, term: str, document_id: str, frequency: int):
6        postings = self.index.get(term, [])
7        postings.append(Posting(document_id, frequency))
8        self.index[term] = postings
9
10    def get_postings(self, term: str) -> List[Posting]:
11        return self.index.get(term, [])

This is the heart of the search engine that enables fast lookup of documents containing a query term.

An inverted index is the fundamental data structure that makes search engines fast. Instead of scanning every document for a query term (which would be very slow), we pre-process the documents and build a map from each term (word) to a list of documents that contain it.

4.5 SearchResult

Pairs a document with its calculated relevance score. Used for ranking and presenting the final search results to the user.

1class SearchResult:
2    def __init__(self, document: Document, score: float):
3        self.document = document
4        self.score = score
5
6    def get_document(self) -> Document:
7        return self.document
8
9    def get_score(self) -> float:
10        return self.score
11
12    def __str__(self) -> str:
13        return f"  - {self.document.get_title()} (Score: {self.score:.2f})"

4.6 Scoring Strategies

Implements the Strategy pattern for scoring.

1class ScoringStrategy(ABC):
2    @abstractmethod
3    def calculate_score(self, term: str, posting: Posting, document: Document) -> float:
4        pass
5
6class TermFrequencyScoringStrategy(ScoringStrategy):
7    def calculate_score(self, term: str, posting: Posting, document: Document) -> float:
8        return posting.get_frequency()
9
10class TitleBoostScoringStrategy(ScoringStrategy):
11    TITLE_BOOST_FACTOR = 1.5
12
13    def calculate_score(self, term: str, posting: Posting, document: Document) -> float:
14        score = posting.get_frequency()
15        if term in document.get_title().lower():
16            score *= self.TITLE_BOOST_FACTOR
17        return score

 The ScoringStrategy interface defines a contract for any scoring algorithm. The SearchEngine holds a reference to an object of this type. This allows us to easily switch between a simple TermFrequencyScoringStrategy and a more advanced TitleBoostScoringStrategy at runtime.

4.7 Ranking Strategies

Implements the Strategy pattern for ranking.

1class RankingStrategy(ABC):
2    @abstractmethod
3    def rank(self, results: List[SearchResult]):
4        pass
5
6class ScoreBasedRankingStrategy(RankingStrategy):
7    def rank(self, results: List[SearchResult]):
8        results.sort(key=lambda x: x.get_score(), reverse=True)
9
10class ScoreThenAlphabeticalRankingStrategy(RankingStrategy):
11    def rank(self, results: List[SearchResult]):
12        results.sort(key=lambda x: (-x.get_score(), x.get_document().get_title()))

Similar to scoring, the RankingStrategy allows us to define different ways to order the final results. The ScoreBasedRankingStrategy provides a standard relevance sort, while the ScoreThenAlphabeticalRankingStrategy shows how we can handle tie-breaking gracefully, a common requirement in real-world systems.

4.8 SearchEngine

This class acts as a central Facade and Singleton, orchestrating all the components to provide a simple API for indexing and searching.

1class SearchEngine:
2    _instance = None
3
4    def __new__(cls):
5        if cls._instance is None:
6            cls._instance = super().__new__(cls)
7        return cls._instance
8
9    def __init__(self):
10        if hasattr(self, '_initialized'):
11            return
12        self._initialized = True
13        self.inverted_index = InvertedIndex()
14        self.document_store = DocumentStore()
15        self.scoring_strategy = None
16        self.ranking_strategy = None
17
18    @classmethod
19    def get_instance(cls):
20        if cls._instance is None:
21            cls._instance = cls()
22        return cls._instance
23
24    def set_scoring_strategy(self, scoring_strategy: ScoringStrategy):
25        self.scoring_strategy = scoring_strategy
26
27    def set_ranking_strategy(self, ranking_strategy: RankingStrategy):
28        self.ranking_strategy = ranking_strategy
29
30    def index_documents(self, documents: List[Document]):
31        for doc in documents:
32            self.index_document(doc)
33
34    def index_document(self, doc: Document):
35        self.document_store.add_document(doc)
36        term_frequencies: Dict[str, int] = {}
37
38        text = (doc.get_title() + " " + doc.get_content()).lower()
39        tokens = re.split(r'\W+', text)
40
41        for token in tokens:
42            if token:
43                term_frequencies[token] = term_frequencies.get(token, 0) + 1
44
45        for term, frequency in term_frequencies.items():
46            self.inverted_index.add(term, doc.get_id(), frequency)
47
48    def search(self, query: str) -> List[SearchResult]:
49        processed_query = query.lower()
50
51        postings = self.inverted_index.get_postings(processed_query)
52
53        results = []
54        for posting in postings:
55            doc = self.document_store.get_document(posting.get_document_id())
56            if doc is not None:
57                score = self.scoring_strategy.calculate_score(processed_query, posting, doc)
58                results.append(SearchResult(doc, score))
59
60        self.ranking_strategy.rank(results)
61
62        return results
  • Singleton Pattern: The engine is a Singleton to ensure there is only one instance managing the index and document store for the entire application.
  • Facade Pattern: It provides a simple, high-level API (indexDocuments, search) that hides the underlying complexity of tokenization, index management, scoring, and ranking.
  • Indexing Process: The indexDocument method demonstrates the full pipeline: tokenizing text, counting term frequencies, and populating the InvertedIndex.
  • Search Process: The search method orchestrates the retrieval: it gets candidate documents from the index, uses the injected ScoringStrategy to calculate their relevance, and then uses the injected RankingStrategy to sort them before returning the final list.

4.9 SearchEngineDemo

This driver class shows how a client would interact with the SearchEngine and demonstrates the flexibility of the strategy-based design.

1class SearchEngineDemo:
2    @staticmethod
3    def main():
4        engine = SearchEngine.get_instance()
5
6        documents = [
7            Document("doc1", "Java Performance", "Java is a high-performance language. Tuning Java applications is key."),
8            Document("doc2", "Introduction to Python", "Python is a versatile language, great for beginners."),
9            Document("doc3", "Advanced Java Concepts", "This document covers advanced topics in Java programming."),
10            Document("doc4", "Python vs. Java", "A document comparing Python and Java for web development. Java is faster.")
11        ]
12
13        print("Indexing documents...")
14        engine.index_documents(documents)
15        print("Indexing complete.\n")
16
17        print("====== TermFrequency Scoring + ScoreBased Ranking ======")
18        engine.set_scoring_strategy(TermFrequencyScoringStrategy())
19        engine.set_ranking_strategy(ScoreBasedRankingStrategy())
20
21        SearchEngineDemo.perform_search(engine, "java")
22        SearchEngineDemo.perform_search(engine, "language")
23        SearchEngineDemo.perform_search(engine, "performance")
24
25        print("\n====== TitleBoost Scoring + Score-then-Alphabetical Ranking ======")
26        engine.set_scoring_strategy(TitleBoostScoringStrategy())
27        engine.set_ranking_strategy(ScoreThenAlphabeticalRankingStrategy())
28
29        SearchEngineDemo.perform_search(engine, "java")
30        SearchEngineDemo.perform_search(engine, "language")
31        SearchEngineDemo.perform_search(engine, "performance")
32
33        SearchEngineDemo.perform_search(engine, "paint")
34
35    @staticmethod
36    def perform_search(engine: SearchEngine, query: str):
37        print(f"--- Searching for: '{query}' ---")
38        results = engine.search(query)
39
40        if not results:
41            print("  No results found.")
42        else:
43            for i, result in enumerate(results):
44                print(f"Rank {i + 1}:{result}")
45        print()
46
47
48if __name__ == "__main__":
49    SearchEngineDemo.main()

5. Run and Test

Languages
Java
C#
Python
C++
Files13
entities
strategies
document_store.py
search_engine_demo.py
main
search_engine.py
search_engine_demo.py
Output

6. Quiz

Design Simple Search Engine - Quiz

1 / 20
Multiple Choice

Which entity is responsible for representing a single piece of content that can be searched in the simple search engine design?

How helpful was this article?

Comments


0/2000

No comments yet. Be the first to comment!

Copilot extension content script